[Updated: Mon, Nov 15, 2021 - 15:14:51 ]
Measuring distance between two data points is at the core of K Nearest Neighbors (KNN) algorithm, and it is important to understand the concept of distance between two vectors.
Imagine that each observation in a dataset lives in an P-dimensional space, where P is the number of predictors.
A general definition of distance between two vectors is the Minkowski Distance. The Minkowski Distance can be defined as
\[\left ( \sum_{i=1}^{P}|A_i-B_i|^q \right )^{\frac{1}{q}},\] where \(q\) can take any positive value.
For the sake of simplicity, suppose that we have two observations and three predictors, and we observe the following values for the two observations on these three predictors.
Observation 1: (20,25,30)
Observation 2: (80,90,75)
If we assume that the \(q=1\) for the Minkowski equation above, then the we can calculate the distance as the following:
A <- c(20,25,30)
B <- c(80,90,75)
sum(abs(A - B))[1] 170
When \(q\) is equal to 1 for the Minkowski equation, it becomes a special case and is known as Manhattan Distance. Manhattan Distance between these two data points is visualized below.
If we assume that the \(q=2\) for the Minkowski equation above, then the we can calculate the distance as the following:
A <- c(20,25,30)
B <- c(80,90,75)
(sum(abs(A - B)^2))^(1/2)[1] 99.24717
When \(q\) is equal to 2 for the Minkowski equation, it is also a special case and is known as Euclidian Distance. Euclidian Distance between these two data points is visualized below.
Given \(N\) observations in a dataset, a distance between any observation and \(N-1\) remaining observations using Minkowski distance (with a user-defined choice of \(q\) value). Then, for any given observation, we can rank order the remaining observations based on how close they are to the given observation and then decide the K nearest neighbors (\(K = 1, 2, 3, ..., N-1\)), K observations closest to the given observation based on their distance.
Suppose that there are 10 observations measured on three predictor variables (X1, X2, and X3) with the following values.
d <- data.frame(x1 =c(20,25,30,42,10,60,65,55,80,90),
x2 =c(10,15,12,20,45,75,70,80,85,90),
x3 =c(25,30,35,20,40,80,85,90,92,95),
label= c('A','B','C','D','E','F','G','H','I','J'))
d x1 x2 x3 label
1 20 10 25 A
2 25 15 30 B
3 30 12 35 C
4 42 20 20 D
5 10 45 40 E
6 60 75 80 F
7 65 70 85 G
8 55 80 90 H
9 80 85 92 I
10 90 90 95 J
Given that there 10 observations, we can calculate the distance between all 45 pairs of observations (e.g., Euclidian distance).
dist <- as.data.frame(t(combn(1:10,2)))
dist$euclidian <- NA
for(i in 1:nrow(dist)){
a <- d[dist[i,1],1:3]
b <- d[dist[i,2],1:3]
dist[i,]$euclidian <- sqrt(sum((a-b)^2))
}
dist V1 V2 euclidian
1 1 2 8.660254
2 1 3 14.282857
3 1 4 24.677925
4 1 5 39.370039
5 1 6 94.074439
6 1 7 96.046864
7 1 8 101.734950
8 1 9 117.106789
9 1 10 127.279221
10 2 3 7.681146
11 2 4 20.346990
12 2 5 35.000000
13 2 6 85.586214
14 2 7 87.464278
15 2 8 93.407708
16 2 9 108.485022
17 2 10 118.638105
18 3 4 20.808652
19 3 5 38.910153
20 3 6 83.030115
21 3 7 84.196199
22 3 8 90.961530
23 3 9 105.252078
24 3 10 115.256236
25 4 5 45.265881
26 4 6 83.360662
27 4 7 85.170417
28 4 8 93.107465
29 4 9 104.177733
30 4 10 113.265176
31 5 6 70.710678
32 5 7 75.332596
33 5 8 75.828754
34 5 9 95.937480
35 5 10 107.004673
36 6 7 8.660254
37 6 8 12.247449
38 6 9 25.377155
39 6 10 36.742346
40 7 8 15.000000
41 7 9 22.338308
42 7 10 33.541020
43 8 9 25.573424
44 8 10 36.742346
45 9 10 11.575837
Now, for instance, we can find the three closest observation to Point E (3-Nearest Neighbors). As seen below, the 3-Nearest Neighbors for Point E in this dataset would be Point B, Point C, and Point A.
# Point E is the fifth observation in the dataset
loc <- which(dist[,1]==5 | dist[,2]==5)
tmp <- dist[loc,]
tmp[order(tmp$euclidian),] V1 V2 euclidian
12 2 5 35.00000
19 3 5 38.91015
4 1 5 39.37004
25 4 5 45.26588
31 5 6 70.71068
32 5 7 75.33260
33 5 8 75.82875
34 5 9 95.93748
35 5 10 107.00467
The \(q\) in the Minkowski distance equation and \(K\) in the K-nearest neighbor are user-defined hyperparameters in the KNN algorithm. As a researcher and model builder, you can pick any values for \(q\) and \(K\). They can be tuned using a similar approach applied in earlier classes for regularized regression models. One can pick a set of values for these hyperparameters and apply a grid search to find the combination that provides the best predictive performance.
It is typical to observe overfitting (high model variance, low model bias) for small values of K and underfitting (low model variance, high model bias) for large values of K. In general practice, people tend to focus their grid search for K around \(\sqrt{N}\).
It is important to keep in mind that the distance calculation between two observations is highly dependent on the scale of measurement for the predictor variables. If predictors are on different scales, the distance metric formula will favor the differences in predictors with larger scale. This is not an ideal situation. Therefore, it is important to center and scale all predictors prior to the KNN algorithm so each predictor contributes to the distance metric calculation in the same way.
Given that we learn about distance calculation and how to identify K-nearest neighbors based on a distance metric, the prediction in KNN is very simple and straightforward.
Below is a list of steps for predicting an outcome for a given observation.
Calculate the distance between the observation and remaining \(N-1\) observations in the data (with a user choice of \(q\) in Minkowski distance).
Rank order the observations based on the calculated distance, and choose the K-nearest neighbor. (with a user choice of \(K\))
Calculate the mean of observed outcome in the smaller set of K-neareast neighbors as your prediction.
Note that, Step 3 applies regardless of the type of outcome. If the outcome variable is continous, we take the average outcome for the K-nearest neighbors as our prediction. If the outcome variable is binary variable (e.g., 0 vs. 1), then the proportion of observing each class among the K-nearest neighbors yield predicted probabilities for each class.
Below, I provide an example for both types of outcome using the Readability and Recidivism datasets.
The code below is identical to the code we used in earlier classes for data preparation of the Readability datasets. Note that this is only to demonstrate the logic of model predictions in the context of K-nearest neighbors. So, we are using the whole dataset. In the next section, we will demonstrate the full workflow of model training and tuning with 10-fold cross-validation using the caret::train() function.
# Import the dataset
readability <- read.csv('https://raw.githubusercontent.com/uo-datasci-specialization/c4-ml-fall-2021/main/data/readability_features.csv',header=TRUE)
# Remove the variables with more than 80% missingness
require(finalfit)
missing_ <- ff_glimpse(readability)$Continuous
flag_na <- which(as.numeric(missing_$missing_percent) > 80)
readability <- readability[,-flag_na]
# Write the recipe
require(recipes)
blueprint_readability <- recipe(x = readability,
vars = colnames(readability),
roles = c(rep('predictor',990),'outcome')) %>%
step_zv(all_numeric()) %>%
step_nzv(all_numeric()) %>%
step_impute_mean(all_numeric()) %>%
step_normalize(all_numeric_predictors()) %>%
step_corr(all_numeric(),threshold=0.9)
# Apply the recipe
baked_read <- blueprint_readability %>%
prep(training = readability) %>%
bake(new_data = readability)Our final dataset (baked_read) has 2834 observations and 888 columns (887 predictors, and the last column is target outcome). Now, suppose that we would like to predict the readability score for the first observation. The code below will calculate the Minkowski distance (with \(q=2\)) between the first observation and each of the remaining 2833 observations by using the first 887 columns of the dataset (predictors).
dist <- data.frame(obs = 2:2834,dist = NA,target=NA)
for(i in 1:2833){
a <- as.matrix(baked_read[1,1:887])
b <- as.matrix(baked_read[dist[i+1,1],1:887])
dist[i,]$dist <- sqrt(sum((a-b)^2))
dist[i,]$target <- baked_read[dist[i+1,1],]$target
#print(i)
}We now rank order the observations from closest to the most distant, and then choose the 20-nearest observations (K=20). Finally, we can calculate the average of the observed outcome for the 20-nearest neighbors, this will become our prediction of the readability score for the first observation.
# Rank order the observations from closest to the most distant
dist <- dist[order(dist$dist),]
# Check the 20-nearest neighbors
dist[1:20,] obs dist target
2439 2440 23.39269 0.55897492
2417 2418 25.35221 -0.21279072
2311 2312 25.90366 -0.25321371
2347 2348 26.15836 -0.15932546
2483 2484 26.28408 -0.92538206
2015 2016 26.54107 0.13989288
2318 2319 26.72020 -1.48029561
1569 1570 26.87505 -1.13367793
43 44 26.92854 -0.58635946
13 14 26.97686 0.24580571
2262 2263 27.27469 -0.90345299
117 118 27.28308 -0.20640822
2529 2530 27.31948 -0.32031046
1132 1133 27.54420 0.38387363
123 124 27.69614 -0.09360404
2152 2153 27.70797 -1.11412508
124 125 27.82294 -0.36537882
2267 2268 27.83390 -0.23113845
2520 2521 27.85822 -0.63588777
1243 1244 27.86323 0.70560033
# Mean target for the 20-nearest observations
mean(dist[1:20,]$target)[1] -0.3293602
# Check the actual observed value of reability for the first observation
readability[1,]$target[1] -0.3402591
We can follow the same procedures to predict the recidivism in the second year after the initial release from prison for an individual.
# Import data
recidivism <- read.csv('https://raw.githubusercontent.com/uo-datasci-specialization/c4-ml-fall-2021/main/data/recidivism_y1 removed.csv',header=TRUE)
# Write the recipe
# List of variable types
outcome <- c('Recidivism_Arrest_Year2')
categorical <- c('Residence_PUMA',
'Prison_Offense',
'Age_at_Release',
'Supervision_Level_First',
'Education_Level',
'Prison_Years',
'Gender',
'Race',
'Gang_Affiliated',
'Prior_Arrest_Episodes_DVCharges',
'Prior_Arrest_Episodes_GunCharges',
'Prior_Conviction_Episodes_Viol',
'Prior_Conviction_Episodes_PPViolationCharges',
'Prior_Conviction_Episodes_DomesticViolenceCharges',
'Prior_Conviction_Episodes_GunCharges',
'Prior_Revocations_Parole',
'Prior_Revocations_Probation',
'Condition_MH_SA',
'Condition_Cog_Ed',
'Condition_Other',
'Violations_ElectronicMonitoring',
'Violations_Instruction',
'Violations_FailToReport',
'Violations_MoveWithoutPermission',
'Employment_Exempt')
numeric <- c('Supervision_Risk_Score_First',
'Dependents',
'Prior_Arrest_Episodes_Felony',
'Prior_Arrest_Episodes_Misd',
'Prior_Arrest_Episodes_Violent',
'Prior_Arrest_Episodes_Property',
'Prior_Arrest_Episodes_Drug',
'Prior_Arrest_Episodes_PPViolationCharges',
'Prior_Conviction_Episodes_Felony',
'Prior_Conviction_Episodes_Misd',
'Prior_Conviction_Episodes_Prop',
'Prior_Conviction_Episodes_Drug',
'Delinquency_Reports',
'Program_Attendances',
'Program_UnexcusedAbsences',
'Residence_Changes',
'Avg_Days_per_DrugTest',
'Jobs_Per_Year')
props <- c('DrugTests_THC_Positive',
'DrugTests_Cocaine_Positive',
'DrugTests_Meth_Positive',
'DrugTests_Other_Positive',
'Percent_Days_Employed')
# Convert all nominal, ordinal, and binary variables to factors
# Leave the rest as is
for(i in categorical){
recidivism[,i] <- as.factor(recidivism[,i])
}
# For variables that represent proportions, add/substract a small number
# to 0s/1s for logit transformation
for(i in props){
recidivism[,i] <- ifelse(recidivism[,i]==0,.0001,recidivism[,i])
recidivism[,i] <- ifelse(recidivism[,i]==1,.9999,recidivism[,i])
}
# Blueprint for processing variables
require(recipes)
blueprint_recidivism <- recipe(x = recidivism,
vars = c(categorical,numeric,props,outcome),
roles = c(rep('predictor',48),'outcome')) %>%
step_indicate_na(all_of(categorical),all_of(numeric),all_of(props)) %>%
step_zv(all_numeric()) %>%
step_impute_mean(all_of(numeric),all_of(props)) %>%
step_impute_mode(all_of(categorical)) %>%
step_logit(all_of(props)) %>%
step_ns(all_of(numeric),all_of(props),deg_free=3) %>%
step_normalize(paste0(numeric,'_ns_1'),
paste0(numeric,'_ns_2'),
paste0(numeric,'_ns_3'),
paste0(props,'_ns_1'),
paste0(props,'_ns_2'),
paste0(props,'_ns_3')) %>%
step_dummy(all_of(categorical),one_hot=TRUE) %>%
step_num2factor(Recidivism_Arrest_Year2,
transform = function(x) x + 1,
levels=c('No','Yes'))
# Apply the recipe
baked_recidivism <- blueprint_recidivism %>%
prep(training = recidivism) %>%
bake(new_data = recidivism)The final dataset (baked_recidivism) has 18111 observations and 166 columns (165 predictors, and the first column is the outcome variable). Now, suppose that we would like to predict the probability of Recidivism for the first individual. The code below will calculate the Minkowski distance (with \(q=2\)) between the first individual and each of the remaining 18,110 individuals by using values of the 166 predictors in this dataset.
dist2 <- data.frame(obs = 2:18111,dist = NA,target=NA)
for(i in 1:18110){
a <- as.matrix(baked_recidivism[1,2:165])
b <- as.matrix(baked_recidivism[dist2[i+1,1],2:165])
dist2[i,]$dist <- sqrt(sum((a-b)^2))
dist2[i,]$target <- as.character(baked_recidivism[dist2[i+1,1],]$Recidivism_Arrest_Year2)
#print(i)
}We now rank order the individuals from closest to the most distant, and then choose the 100-nearest observations (K=100). Then, we calculate proportion of individuals who were recidivated (YES) and not recidivated (NO) among these 100-nearest neighbors. These proportions are the predictions for the probability of being recidivated or not recidivated for the first individual.
# Rank order the observations from closest to the most distant
dist2 <- dist2[order(dist2$dist),]
# Check the 100-nearest neighbors
dist2[1:100,] obs dist target
7068 7069 7.659967 No
4525 4526 7.755362 No
8444 8445 7.859353 No
11530 11531 8.081124 Yes
9730 9731 8.090887 No
6662 6663 8.480201 No
4041 4042 8.484186 No
595 596 8.538511 No
645 646 8.558983 Yes
9111 9112 8.663452 No
1696 1697 8.687675 No
1770 1771 8.697517 No
2158 2159 8.742357 No
1027 1028 8.765372 Yes
5116 5117 8.775667 No
3689 3690 8.806331 No
829 830 8.821060 No
8139 8140 8.839652 Yes
384 385 8.856510 Yes
8467 8468 8.861272 No
7313 7314 8.874215 No
8766 8767 8.890606 No
4093 4094 8.927379 No
11385 11386 8.942428 Yes
2713 2714 8.949004 No
4422 4423 8.956050 Yes
1515 1516 8.963143 Yes
13840 13841 9.022972 No
1574 1575 9.051930 No
14383 14384 9.055670 No
8186 8187 9.067094 No
13020 13021 9.068971 No
9070 9071 9.073520 No
6355 6356 9.097278 No
3966 3967 9.108941 Yes
13820 13821 9.109367 No
2947 2948 9.109936 No
894 895 9.116890 No
11091 11092 9.117779 No
14202 14203 9.122793 No
6022 6023 9.135977 No
3434 3435 9.141237 Yes
11015 11016 9.153111 Yes
9946 9947 9.166294 No
12473 12474 9.174083 No
563 564 9.179573 Yes
1572 1573 9.187760 No
7785 7786 9.188086 No
807 808 9.199478 Yes
14783 14784 9.205327 Yes
15152 15153 9.205441 No
10816 10817 9.206677 Yes
3029 3030 9.221680 No
986 987 9.224630 No
2946 2947 9.231919 No
14505 14506 9.241374 No
13910 13911 9.250168 No
5306 5307 9.254097 Yes
4870 4871 9.278566 Yes
14526 14527 9.278633 Yes
4794 4795 9.282708 No
1561 1562 9.287491 No
9025 9026 9.287869 Yes
10167 10168 9.289060 No
2 3 9.305465 No
288 289 9.307759 Yes
599 600 9.336480 No
4778 4779 9.353857 No
4166 4167 9.367567 No
7297 7298 9.371710 No
4644 4645 9.374108 No
2299 2300 9.375541 No
1447 1448 9.386387 No
9676 9677 9.399002 Yes
9134 9135 9.412505 No
4583 4584 9.420181 Yes
7958 7959 9.425067 No
10761 10762 9.425995 Yes
7201 7202 9.437474 No
5314 5315 9.445364 No
1917 1918 9.448661 Yes
15125 15126 9.455830 Yes
3170 3171 9.460053 No
1388 1389 9.461795 Yes
9286 9287 9.465960 No
615 616 9.468258 No
548 549 9.469860 No
4229 4230 9.475141 No
4743 4744 9.476835 No
3641 3642 9.481367 No
4874 4875 9.481934 Yes
9727 9728 9.485522 No
15326 15327 9.486961 No
9381 9382 9.489079 Yes
3362 3363 9.489238 Yes
8641 8642 9.490666 Yes
2745 2746 9.491828 No
5741 5742 9.493817 No
1578 1579 9.496209 No
8963 8964 9.515531 No
# Mean target for the 100-nearest observations
table(dist2[1:100,]$target)
No Yes
70 30
# This indicates that the predicted probability of being recidivated is 0.30
# for the first individual given the observed data for 100 most similar
# observations# Check the actual observed outcome for the first individual
recidivism[1,]$Recidivism_Arrest_Year2[1] 0
In the previous section, we tried to understand how KNN predicts a target outcome by simply averaing the observed value for the target outcome from K-nearest neighbors. This was a simple average by equally weighting each neighbor.
Another way of averaging the target outcome from K-nearest neighbors would be to weight each neighbor according to its distance, and calculate a weighted average. A simple way to weight each neighbor is to use the inverse of distance. For instance, consider the earlier example where we find the 20-nearest neighbor for the first observation in the reeadability dataset.
dist <- dist[order(dist$dist),]
k_neighbors <- dist[1:20,]
k_neighbors obs dist target
2439 2440 23.39269 0.55897492
2417 2418 25.35221 -0.21279072
2311 2312 25.90366 -0.25321371
2347 2348 26.15836 -0.15932546
2483 2484 26.28408 -0.92538206
2015 2016 26.54107 0.13989288
2318 2319 26.72020 -1.48029561
1569 1570 26.87505 -1.13367793
43 44 26.92854 -0.58635946
13 14 26.97686 0.24580571
2262 2263 27.27469 -0.90345299
117 118 27.28308 -0.20640822
2529 2530 27.31948 -0.32031046
1132 1133 27.54420 0.38387363
123 124 27.69614 -0.09360404
2152 2153 27.70797 -1.11412508
124 125 27.82294 -0.36537882
2267 2268 27.83390 -0.23113845
2520 2521 27.85822 -0.63588777
1243 1244 27.86323 0.70560033
We can assign a weight to each neighbor by taking the inverse of their distance and rescaling them such that the sum of the weights are equal to 1.
k_neighbors$weight <- 1/k_neighbors$dist
k_neighbors$weight <- k_neighbors$weight/sum(k_neighbors$weight)
k_neighbors obs dist target weight
2439 2440 23.39269 0.55897492 0.05732923
2417 2418 25.35221 -0.21279072 0.05289814
2311 2312 25.90366 -0.25321371 0.05177203
2347 2348 26.15836 -0.15932546 0.05126793
2483 2484 26.28408 -0.92538206 0.05102271
2015 2016 26.54107 0.13989288 0.05052867
2318 2319 26.72020 -1.48029561 0.05018993
1569 1570 26.87505 -1.13367793 0.04990073
43 44 26.92854 -0.58635946 0.04980163
13 14 26.97686 0.24580571 0.04971241
2262 2263 27.27469 -0.90345299 0.04916958
117 118 27.28308 -0.20640822 0.04915446
2529 2530 27.31948 -0.32031046 0.04908897
1132 1133 27.54420 0.38387363 0.04868847
123 124 27.69614 -0.09360404 0.04842137
2152 2153 27.70797 -1.11412508 0.04840069
124 125 27.82294 -0.36537882 0.04820069
2267 2268 27.83390 -0.23113845 0.04818172
2520 2521 27.85822 -0.63588777 0.04813964
1243 1244 27.86323 0.70560033 0.04813100
Then, we can compute a weighted average of the target scores instead of a simple average.
# Weighted Mean target for the 20-nearest observations
sum(k_neighbors$target*k_neighbors$weight)[1] -0.3239415
There are a number of different kernel function to assign weight to K-nearest neighbors (e.g., epanechnikov, quartic, triweight, tricube, gaussian, cosine). For all of them, closest neighbors are assigned higher weights while the weight gets smaller as the distance increases. They slightly differ the way they assign the weight. Below is a demonstration how assigned weight changes as a function of distance for different kernel functions.
Which kernel function should we use for weighting the distance? The type of kernel function can also be considered as a hyperparameter to tune.
caret::train() functionThe KNN algorithm can become computationally quite intensive and grid search with so many combinations may take days particularly when you have a very large sample size. One way to reduce the computational time and make it reasonable is to use parallel processing when you have access to multiple core computers (or computing clusters).
The caret package is designed in a way to take advantage of multiple cores in a computer via another package doParallel. If you are worried about the computation time for fitting any model using the caret::train function, you can run the following lines of code to use as many cores as available in your computing device.
require(doParallel)
ncores <- 15 # depends on the number of cores available in your computer
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)Once you run these lines of code, anytime you run the caret::train function to train any model, it will use all registered cores to do the computations. This could significantly reduce the computational time from days to hours or from hours to minutes.
After you are done with training the model, you can run the following line of code to stop the computational cluster registered for the use of your current R session.
stopCluster(cl)You can find more information about parallel processing with the caret package at this link.
In this section, we implement the KNN algorithm to build a prediction model for the readability score. The first part of the code is identical to the code we use in earlier classes for the following steps:
# Train and Test Split
set.seed(10152021) # for reproducibility
loc <- sample(1:nrow(readability), round(nrow(readability) * 0.9))
read_tr <- readability[loc, ]
read_te <- readability[-loc, ]
# Create the row indices for 10-folds
# Randomly shuffle the training data
read_tr = read_tr[sample(nrow(read_tr)),]
# Create 10 folds with equal size
folds = cut(seq(1,nrow(read_tr)),breaks=10,labels=FALSE)
# Create the list for each fold
my.indices <- vector('list',10)
for(i in 1:10){
my.indices[[i]] <- which(folds!=i)
}
# Cross-validation settings
cv <- trainControl(method = "cv",
index = my.indices)We will use the kknn package to implement the KNN algorithm, and this method is available through the caret::train() function. First, let’s check which hyperparameters are available to tune for this method.
# install.package('kknn')
require(caret)
require(kknn)
getModelInfo()$kknn$parameters parameter class label
1 kmax numeric Max. #Neighbors
2 distance numeric Distance
3 kernel character Kernel
There are three hyperparameters available to optimize for the kknn method. These are the same parameters discussed earlier:
kmax: the choice of K for the K-nearest neighbors (numeric).distance: the choice of power value (\(q\)) for the Minkowski distance (numeric)kernel: the choice of kernel function for weigted predictions (character)The possible kernel functions available are “rectangular”,“triangular”, “epanechnikov”, “biweight”, “triweight”, “cos”, “inv”, and “gaussian”. The “rectangular” indicates a simple average with no weight applied (equally weighted).
The next step is to create a matrix for hyperparameter grid search. Let’s consider the following values for these three hyperparameters:
kmax: 2, 3, 4,…, 25distance: 1, 2, 3kernel: rectangular, epanechnikov# Hyperparameter Tuning Grid
grid <- expand.grid(kmax = 3:25,
distance = c(1,2,3),
kernel = c('epanechnikov','rectangular'))
grid kmax distance kernel
1 3 1 epanechnikov
2 4 1 epanechnikov
3 5 1 epanechnikov
4 6 1 epanechnikov
5 7 1 epanechnikov
6 8 1 epanechnikov
7 9 1 epanechnikov
8 10 1 epanechnikov
9 11 1 epanechnikov
10 12 1 epanechnikov
11 13 1 epanechnikov
12 14 1 epanechnikov
13 15 1 epanechnikov
14 16 1 epanechnikov
15 17 1 epanechnikov
16 18 1 epanechnikov
17 19 1 epanechnikov
18 20 1 epanechnikov
19 21 1 epanechnikov
20 22 1 epanechnikov
21 23 1 epanechnikov
22 24 1 epanechnikov
23 25 1 epanechnikov
24 3 2 epanechnikov
25 4 2 epanechnikov
26 5 2 epanechnikov
27 6 2 epanechnikov
28 7 2 epanechnikov
29 8 2 epanechnikov
30 9 2 epanechnikov
31 10 2 epanechnikov
32 11 2 epanechnikov
33 12 2 epanechnikov
34 13 2 epanechnikov
35 14 2 epanechnikov
36 15 2 epanechnikov
37 16 2 epanechnikov
38 17 2 epanechnikov
39 18 2 epanechnikov
40 19 2 epanechnikov
41 20 2 epanechnikov
42 21 2 epanechnikov
43 22 2 epanechnikov
44 23 2 epanechnikov
45 24 2 epanechnikov
46 25 2 epanechnikov
47 3 3 epanechnikov
48 4 3 epanechnikov
49 5 3 epanechnikov
50 6 3 epanechnikov
51 7 3 epanechnikov
52 8 3 epanechnikov
53 9 3 epanechnikov
54 10 3 epanechnikov
55 11 3 epanechnikov
56 12 3 epanechnikov
57 13 3 epanechnikov
58 14 3 epanechnikov
59 15 3 epanechnikov
60 16 3 epanechnikov
61 17 3 epanechnikov
62 18 3 epanechnikov
63 19 3 epanechnikov
64 20 3 epanechnikov
65 21 3 epanechnikov
66 22 3 epanechnikov
67 23 3 epanechnikov
68 24 3 epanechnikov
69 25 3 epanechnikov
70 3 1 rectangular
71 4 1 rectangular
72 5 1 rectangular
73 6 1 rectangular
74 7 1 rectangular
75 8 1 rectangular
76 9 1 rectangular
77 10 1 rectangular
78 11 1 rectangular
79 12 1 rectangular
80 13 1 rectangular
81 14 1 rectangular
82 15 1 rectangular
83 16 1 rectangular
84 17 1 rectangular
85 18 1 rectangular
86 19 1 rectangular
87 20 1 rectangular
88 21 1 rectangular
89 22 1 rectangular
90 23 1 rectangular
91 24 1 rectangular
92 25 1 rectangular
93 3 2 rectangular
94 4 2 rectangular
95 5 2 rectangular
96 6 2 rectangular
97 7 2 rectangular
98 8 2 rectangular
99 9 2 rectangular
100 10 2 rectangular
101 11 2 rectangular
102 12 2 rectangular
103 13 2 rectangular
104 14 2 rectangular
105 15 2 rectangular
106 16 2 rectangular
107 17 2 rectangular
108 18 2 rectangular
109 19 2 rectangular
110 20 2 rectangular
111 21 2 rectangular
112 22 2 rectangular
113 23 2 rectangular
114 24 2 rectangular
115 25 2 rectangular
116 3 3 rectangular
117 4 3 rectangular
118 5 3 rectangular
119 6 3 rectangular
120 7 3 rectangular
121 8 3 rectangular
122 9 3 rectangular
123 10 3 rectangular
124 11 3 rectangular
125 12 3 rectangular
126 13 3 rectangular
127 14 3 rectangular
128 15 3 rectangular
129 16 3 rectangular
130 17 3 rectangular
131 18 3 rectangular
132 19 3 rectangular
133 20 3 rectangular
134 21 3 rectangular
135 22 3 rectangular
136 23 3 rectangular
137 24 3 rectangular
138 25 3 rectangular
There are a total of 138 combinations we would like to search. Note that, this means a total of 1380 model fitted with 10-fold cross validation for each combination. Also, the KNN algorithm is very slow to fit (because you have to calculate N*(N-1)/2 distances). This may take very long. Therefore, I will also use parallel processing with 15 cores.
require(doParallel)
ncores <- 15 # depends on the number of cores available in your computer
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)Now, we can fit the model with these settings.
# Train the model
caret_knn_readability <- caret::train(blueprint_readability,
data = read_tr,
method = "kknn",
trControl = cv,
tuneGrid = grid)
caret_knn_readability$times$everything
user system elapsed
208.17 5.86 11534.15
$final
user system elapsed
186.16 0.50 186.25
$prediction
[1] NA NA NA
The training took 11534.15 seconds (~ 3 hours 12 minutes) with 15 cores running at the same time (!).
Let’s check the best combination of hyperparameters that optimizes the 10-fold cross validated performance metric (RMSE).
plot(caret_knn_readability)caret_knn_readability$bestTune kmax distance kernel
115 22 1 epanechnikov
Now, we can check the performance of the KNN algorithm on the test dataset.
predicted_te <- predict(caret_knn_readability,read_te)
# R-square
cor(read_te$target,predicted_te)^2[1] 0.6226773
# RMSE
sqrt(mean((read_te$target - predicted_te)^2))[1] 0.6297403
# MAE
mean(abs(read_te$target - predicted_te))[1] 0.4998969
The table below provides a comparison to the models we discussed earlier.
| R-square | MAE | RMSE | |
|---|---|---|---|
| Linear Regression | 0.644 | 0.522 | 0.644 |
| Ridge Regression | 0.727 | 0.435 | 0.536 |
| Lasso Regression | 0.725 | 0.434 | 0.538 |
| KNN | 0.623 | 0.500 | 0.629 |
You can ask for the variable importance plot from the KNN algorithmsimilar to other models (although I am not sure how they calculate the variable importance in the context of KNN).
require(vip)
vip(caret_knn_readability,
num_features = 10,
geom = "point") +
theme_bw()